In homework 2 you will fit many regression models. You are welcome to explore beyond what the question is asking you.
Please come see us we are here to help.
The folder earnings has data from the Work, Family, and Well-Being Survey (Ross, 1990). You can find the codebook at http://www.stat.columbia.edu/~gelman/arm/examples/earnings/wfwcodebook.txt
gelman_dir <- "http://www.stat.columbia.edu/~gelman/arm/examples/"
heights <- read.dta (paste0(gelman_dir,"earnings/heights.dta"))
Pull out the data on earnings, sex, height, and weight.
library(tidyverse)
## ─ Attaching packages ────────────────────────────── tidyverse 1.2.1 ─
## ✔ tibble 1.4.2 ✔ purrr 0.2.5
## ✔ tidyr 0.8.1 ✔ dplyr 0.7.6
## ✔ readr 1.1.1 ✔ stringr 1.3.1
## ✔ tibble 1.4.2 ✔ forcats 0.3.0
## ─ Conflicts ─────────────────────────────── tidyverse_conflicts() ─
## ✖ dplyr::between() masks data.table::between()
## ✖ tidyr::expand() masks Matrix::expand()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::first() masks data.table::first()
## ✖ dplyr::lag() masks stats::lag()
## ✖ dplyr::last() masks data.table::last()
## ✖ dplyr::select() masks MASS::select()
## ✖ purrr::transpose() masks data.table::transpose()
map_dbl(heights, ~ sum(is.na(.x)))
## earn height1 height2 sex race hisp ed yearbn height
## 650 8 6 0 0 0 0 0 8
na<-which(!complete.cases(heights))
heights_withoutNa<-heights[-na,]
map_dbl(heights_withoutNa, ~ sum(is.na(.x)))
## earn height1 height2 sex race hisp ed yearbn height
## 0 0 0 0 0 0 0 0 0
zero_earn<-which(heights_withoutNa$earn==0)
heights_clean<-heights_withoutNa[-zero_earn,]
# For this model, we should use center on mean in height
center_height<-heights_clean$height-mean(heights_clean$height)
mod2<-lm(earn~center_height,data=heights_clean)
summary(mod2)
##
## Call:
## lm(formula = earn ~ center_height, data = heights_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -30166 -11309 -3428 6527 172953
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23154.8 546.4 42.376 <2e-16 ***
## center_height 1262.3 142.1 8.883 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18870 on 1190 degrees of freedom
## Multiple R-squared: 0.06218, Adjusted R-squared: 0.06139
## F-statistic: 78.9 on 1 and 1190 DF, p-value: < 2.2e-16
plot(mod2)
# log transformation on log(earn+1), and I have already cleaned all zeros in earn
mod3<-lm(log(earn)~height+sex+ed,data=heights_clean)
summary(mod3)
##
## Call:
## lm(formula = log(earn) ~ height + sex + ed, data = heights_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.4496 -0.3379 0.1320 0.5151 2.1989
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.252336 0.673976 12.244 < 2e-16 ***
## height 0.009061 0.008875 1.021 0.307
## sex -0.467729 0.068733 -6.805 1.6e-11 ***
## ed 0.117962 0.010057 11.730 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8343 on 1188 degrees of freedom
## Multiple R-squared: 0.1814, Adjusted R-squared: 0.1793
## F-statistic: 87.73 on 3 and 1188 DF, p-value: < 2.2e-16
plot(mod3)
# z-score transformation on height
height_z<-center_height/sd(heights_clean$height)
mod33<-lm(earn~height_z+sex+ed,data=heights_clean)
summary(mod33)
##
## Call:
## lm(formula = earn ~ height_z + sex + ed, data = heights_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -40945 -10027 -2379 5763 158605
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3418.4 3578.7 0.955 0.340
## height_z 706.0 715.7 0.986 0.324
## sex -10083.5 1440.9 -6.998 4.32e-12 ***
## ed 2638.5 210.8 12.515 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17490 on 1188 degrees of freedom
## Multiple R-squared: 0.1953, Adjusted R-squared: 0.1933
## F-statistic: 96.11 on 3 and 1188 DF, p-value: < 2.2e-16
plot(mod33)
# I prefer to the log transformation on log(earn+1), because there are some very big earn in the data, which definitely influence the graph and linear model.
# For the log transformation on log(earn+1), we can interpret all model coefficients like
# Earn= exp(8.25)*exp(0.009)^height*exp(0.117)^ed/exp(0.468)^sex
# The coefficients mean that when height=0, ed=0 and sex=0, Earn=exp(8.25). Each unit increase in height makes Earn exp(0.009)times higher.Each unit increase in sd makes Earn exp(0.117)times higher.Each unit increase in sex makes Earn exp(0.468)times lower.
round(confint(mod2,level=0.95))
## 2.5 % 97.5 %
## (Intercept) 22083 24227
## center_height 984 1541
# It means that we have 95% confidence in mod2 that Earn is in the terval(22082,24226)
round(confint(mod3,level=0.95))
## 2.5 % 97.5 %
## (Intercept) 7 10
## height 0 0
## sex -1 0
## ed 0 0
# It means that we have 95% confidence in mod3 that log(Earn) is in the terval(6.93,9.57)
The folder pollution contains mortality rates and various environmental factors from 60 U.S. metropolitan areas from McDonald, G.C. and Schwing, R.C. (1973) ‘Instabilities of regression estimates relating air pollution to mortality’, Technometrics, vol.15, 463-482.
Variables, in order:
For this exercise we shall model mortality rate given nitric oxides, sulfur dioxide, and hydrocarbons as inputs. This model is an extreme oversimplification as it combines all sources of mortality and does not adjust for crucial factors such as age and smoking. We use it to illustrate log transformations in regression.
gelman_dir <- "http://www.stat.columbia.edu/~gelman/arm/examples/"
pollution <- read.dta (paste0(gelman_dir,"pollution/pollution.dta"))
library(ggplot2)
ggplot(data=pollution)+geom_point(mapping = aes(x=mort,y=nox))
mod1<-lm(mort~nox,data=pollution)
summary(mod1)
##
## Call:
## lm(formula = mort ~ nox, data = pollution)
##
## Residuals:
## Min 1Q Median 3Q Max
## -148.654 -43.710 1.751 41.663 172.211
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 942.7115 9.0034 104.706 <2e-16 ***
## nox -0.1039 0.1758 -0.591 0.557
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 62.55 on 58 degrees of freedom
## Multiple R-squared: 0.005987, Adjusted R-squared: -0.01115
## F-statistic: 0.3494 on 1 and 58 DF, p-value: 0.5568
# The regression will not fit these data well
plot(resid(mod1))
plot(mod1)
# log transformation on log(nox+1) due to some very big values in nox in the scatter plot
mod22<-lm(mort~log(nox+1),data=pollution)
summary(mod22)
##
## Call:
## lm(formula = mort ~ log(nox + 1), data = pollution)
##
## Residuals:
## Min 1Q Median 3Q Max
## -165.622 -29.433 8.987 36.534 166.293
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 901.597 20.078 44.904 <2e-16 ***
## log(nox + 1) 15.661 7.474 2.096 0.0405 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 60.49 on 58 degrees of freedom
## Multiple R-squared: 0.07038, Adjusted R-squared: 0.05435
## F-statistic: 4.391 on 1 and 58 DF, p-value: 0.0405
plot(mod22)
# According to the summary of mod22, we get mort=901.6+15.661log(nox)
# When nox=1, mort=901.6. Each unit increase in log(nox) leads to 15.661 increase in mort.
round(confint(mod22,level=0.99),2)
## 0.5 % 99.5 %
## (Intercept) 848.12 955.07
## log(nox + 1) -4.24 35.57
# The 99% CI for the slope is (-4.24,35.57), which across 0. Therefore, it is not significant.
# We use log transformation on both nox and hc, and center on mean in so2
so2_center<-pollution$so2-mean(pollution$so2)
mod5<-lm(mort~log(nox)+so2_center+log(hc),data=pollution)
summary(mod5)
##
## Call:
## lm(formula = mort ~ log(nox) + so2_center + log(hc), data = pollution)
##
## Residuals:
## Min 1Q Median 3Q Max
## -98.262 -35.757 -0.413 37.602 171.152
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 957.8402 21.4379 44.680 <2e-16 ***
## log(nox) 56.0779 22.7132 2.469 0.0166 *
## so2_center 0.2638 0.1654 1.594 0.1165
## log(hc) -53.6761 20.1715 -2.661 0.0101 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 54.43 on 56 degrees of freedom
## Multiple R-squared: 0.2733, Adjusted R-squared: 0.2344
## F-statistic: 7.02 on 3 and 56 DF, p-value: 0.0004336
plot(mod5)
# The linear regression mod is that mort=957.84+56.078*log(nox)+0.264so2_center-53.676*log(hc)
# The interpret of the coefficients: when nox=1, so2=0 and hc=1, mort=957.84. Each unit increase in log(nox) leads to 56.078 increase in mort. Each unit increase in so2_center leads to 0.264 increase in mort. Each unit increase in log(hc) leads to 53.676 decrease in mort.
data_firsthalf<-pollution[1:30,]
so2_center<-data_firsthalf$so2-mean(data_firsthalf$so2)
mod6<-lm(mort~log(nox)+so2_center+log(hc),data=data_firsthalf)
PI <- predict(mod6 , interval = "prediction")
## Warning in predict.lm(mod6, interval = "prediction"): predictions on current data refer to _future_ responses
PI
## fit lwr upr
## 1 948.5196 838.2094 1058.830
## 2 944.6024 828.5043 1060.701
## 3 940.2922 827.7335 1052.851
## 4 931.5208 819.4004 1043.641
## 5 1005.7045 886.5626 1124.846
## 6 956.9330 841.0506 1072.815
## 7 955.7156 832.4293 1079.002
## 8 926.2348 815.3447 1037.125
## 9 939.6170 829.0159 1050.218
## 10 931.8108 821.3922 1042.229
## 11 932.6385 820.5771 1044.700
## 12 1031.5490 899.9133 1163.185
## 13 983.8440 870.1099 1097.578
## 14 950.3428 839.2196 1061.466
## 15 927.3887 813.3318 1041.446
## 16 926.5966 806.8878 1046.305
## 17 930.7057 820.0210 1041.390
## 18 933.4131 821.9097 1044.916
## 19 973.1574 861.2121 1085.103
## 20 924.5802 809.4939 1039.666
## 21 926.5966 806.8878 1046.305
## 22 927.4446 815.9921 1038.897
## 23 922.2764 807.0161 1037.537
## 24 925.0784 811.3795 1038.777
## 25 926.9072 814.7143 1039.100
## 26 936.0913 825.6355 1046.547
## 27 925.1508 814.1932 1036.108
## 28 936.8935 826.9262 1046.861
## 29 975.3800 845.4208 1105.339
## 30 1001.5165 883.9411 1119.092
data(teengamb)
?teengamb
center_verbal<-teengamb$verbal-mean(teengamb$verbal)
mod11<-lm(log(gamble+1)~log(income+1)+sex+status+center_verbal,data=teengamb)
summary(mod11)
##
## Call:
## lm(formula = log(gamble + 1) ~ log(income + 1) + sex + status +
## center_verbal, data = teengamb)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.5829 -0.5913 0.1162 0.8323 2.0818
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.76617 0.98339 -0.779 0.44028
## log(income + 1) 1.25898 0.30774 4.091 0.00019 ***
## sex -1.03618 0.39263 -2.639 0.01161 *
## status 0.02622 0.01354 1.937 0.05946 .
## center_verbal -0.25497 0.10613 -2.402 0.02079 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.109 on 42 degrees of freedom
## Multiple R-squared: 0.4994, Adjusted R-squared: 0.4517
## F-statistic: 10.48 on 4 and 42 DF, p-value: 5.615e-06
plot(mod11)
round(confint(mod11,level=0.95),2)
## 2.5 % 97.5 %
## (Intercept) -2.75 1.22
## log(income + 1) 0.64 1.88
## sex -1.83 -0.24
## status 0.00 0.05
## center_verbal -0.47 -0.04
# None of these CI of coefficients across 0. Therefore, they are all significant
predict_mean<-predict(lm(log(gamble+1)~log(income+1)+sex+status+verbal,data=teengamb),data.frame(sex=0,status=mean(teengamb$status),income=mean(teengamb$income),verbal=mean(teengamb$verbal)),level=0.95,interval = "prediction")
predict_max<-predict(lm(log(gamble+1)~log(income+1)+sex+status+verbal,data=teengamb),data.frame(sex=0,status=max(teengamb$status),income=max(teengamb$income),verbal=max(teengamb$verbal)),level=0.95,interval = "prediction")
predict_mean-predict_max
## fit lwr upr
## 1 -1.241176 -0.9959947 -1.486358
# PI for a male with maximal value is wider
data(sat)
?sat
center_salary<-sat$salary-mean(sat$salary)
center_ratio<-sat$ratio-mean(sat$ratio)
mod111<-lm(total~center_salary+center_ratio+log(expend),data=sat)
summary(mod111)
##
## Call:
## lm(formula = total ~ center_salary + center_ratio + log(expend),
## data = sat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -142.750 -46.921 1.093 47.136 122.776
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 872.174 240.323 3.629 0.000711 ***
## center_salary -7.214 4.600 -1.568 0.123703
## center_ratio 4.692 6.770 0.693 0.491734
## log(expend) 53.510 137.063 0.390 0.698040
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 68.95 on 46 degrees of freedom
## Multiple R-squared: 0.2027, Adjusted R-squared: 0.1507
## F-statistic: 3.897 on 3 and 46 DF, p-value: 0.01457
plot(mod111)
# Interpret each of the coefficient: when center_salary is 0, center_ratio is 0 and expend is 1, total is 872.174. Each unit increase in center_salary leads to 7.214 decrease in total score. Each unit increase in center_ratio leads to 4.692 increase in total score. Each unit increase in log(expend) leads to 53.51 increase in total score.
round(confint(mod111,level=0.98),2)
## 1 % 99 %
## (Intercept) 292.95 1451.40
## center_salary -18.30 3.87
## center_ratio -11.62 21.01
## log(expend) -276.84 383.86
# All of the CI of the coefficients across 0, which means that are all not significant.
center_salary<-sat$salary-mean(sat$salary)
center_ratio<-sat$ratio-mean(sat$ratio)
mod333<-lm(total~center_salary+center_ratio+log(expend)+takers,data=sat)
plot(resid(mod333))
# The model with takers seems to explain the outcome better. Because its residual plot is more balanced through resid=0.
For a study of congressional elections, you would like a measure of the relative amount of money raised by each of the two major-party candidates in each district. Suppose that you know the amount of money raised by each candidate; label these dollar values \(D_i\) and \(R_i\). You would like to combine these into a single variable that can be included as an input variable into a model predicting vote share for the Democrats.
Discuss the advantages and disadvantages of the following measures:
For observed pair of \(\mathrm{x}\) and \(\mathrm{y}\), we fit a simple regression model \[\mathrm{y}=\alpha + \beta \mathrm{x} + \mathrm{\epsilon}\] which results in estimates \(\hat{\alpha}=1\), \(\hat{\beta}=0.9\), \(SE(\hat{\beta})=0.03\), \(\hat{\sigma}=2\) and \(r=0.3\).
Suppose that the explanatory variable values in a regression are transformed according to the \(\mathrm{x}^{\star}=\mathrm{x}-10\) and that \(\mathrm{y}\) is regressed on \(\mathrm{x}^{\star}\). Without redoing the regression calculation in detail, find \(\hat{\alpha}^{\star}\), \(\hat{\beta}^{\star}\), \(\hat{\sigma}^{\star}\), and \(r^{\star}\). What happens to these quantities when \(\mathrm{x}^{\star}=10\mathrm{x}\) ? When \(\mathrm{x}^{\star}=10(\mathrm{x}-1)\)?
Now suppose that the response variable scores are transformed according to the formula \(\mathrm{y}^{\star\star}= \mathrm{y}+10\) and that \(\mathrm{y}^{\star\star}\) is regressed on \(\mathrm{x}\). Without redoing the regression calculation in detail, find \(\hat{\alpha}^{\star\star}\), \(\hat{\beta}^{\star\star}\), \(\hat{\sigma}^{\star\star}\), and \(r^{\star\star}\). What happens to these quantities when \(\mathrm{y}^{\star\star}=5\mathrm{y}\) ? When \(\mathrm{y}^{\star\star}=5(\mathrm{y}+2)\)?
In general, how are the results of a simple regression analysis affected by linear transformations of \(\mathrm{y}\) and \(\mathrm{x}\)?
Suppose that the explanatory variable values in a regression are transformed according to the \(\mathrm{x}^{\star}=10(\mathrm{x}-1)\) and that \(\mathrm{y}\) is regressed on \(\mathrm{x}^{\star}\). Without redoing the regression calculation in detail, find \(SE(\hat{\beta}^{\star})\) and \(t^{\star}_0= \hat{\beta}^{\star}/SE(\hat{\beta}^{\star})\).
Now suppose that the response variable scores are transformed according to the formula \(\mathrm{y}^{\star\star}=5(\mathrm{y}+2)\) and that \(\mathrm{y}^{\star\star}\) is regressed on \(\mathrm{x}\). Without redoing the regression calculation in detail, find \(SE(\hat{\beta}^{\star\star})\) and \(t^{\star\star}_0= \hat{\beta}^{\star\star}/SE(\hat{\beta}^{\star\star})\).
In general, how are the hypothesis tests and confidence intervals for \(\beta\) affected by linear transformations of \(\mathrm{y}\) and \(\mathrm{x}\)?
Hw2_byhands part
If you have any comments about the homework, or the class, please write your feedback here. We love to hear your opinions.